7 research outputs found

    Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

    Get PDF
    In the modern-day era of technology, a paradigm shift has been witnessed in the areas involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). Specifically, Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI applications such as computer vision, image and video processing, robotics, etc. In the context of developed digital technologies and the availability of authentic data and data handling infrastructure, DNNs have been a credible choice for solving more complex real-life problems. The performance and accuracy of a DNN is a way better than human intelligence in certain situations. However, it is noteworthy that the DNN is computationally too cumbersome in terms of the resources and time to handle these computations. Furthermore, general-purpose architectures like CPUs have issues in handling such computationally intensive algorithms. Therefore, a lot of interest and efforts have been invested by the research fraternity in specialized hardware architectures such as Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array (CGRA) in the context of effective implementation of computationally intensive algorithms. This paper brings forward the various research works carried out on the development and deployment of DNNs using the aforementioned specialized hardware architectures and embedded AI accelerators. The review discusses the detailed description of the specialized hardware-based accelerators used in the training and/or inference of DNN. A comparative study based on factors like power, area, and throughput, is also made on the various accelerators discussed. Finally, future research and development directions are discussed, such as future trends in DNN implementation on specialized hardware accelerators. This review article is intended to serve as a guide for hardware architectures for accelerating and improving the effectiveness of deep learning research.publishedVersio

    Methodology for Structured Data-Path Implementation in VLSI Physical Design: A Case Study

    Get PDF
    State-of-the-art modern microprocessor and domain-specific accelerator designs are dominated by data-paths composed of regular structures, also known as bit-slices. Random logic placement and routing techniques may not result in an optimal layout for these data-path-dominated designs. As a result, implementation tools such as Cadence’s Innovus include a Structured Data-Path (SDP) feature that allows data-path placement to be completely customized by constraining the placement engine. A relative placement file is used to provide these constraints to the tool. However, the tool neither extracts nor automatically places the regular data-path structures. In other words, the relative placement file is not automatically generated. In this paper, we propose a semi-automated method for extracting bit-slices from the Innovus SDP flow. It has been demonstrated that the proposed method results in 17% less density or use for a pixel buffer design. At the same time, the other performance metrics are unchanged when compared to the traditional place and route flow.publishedVersio

    Codegenerierung für eng gekoppelte Prozessorfelder

    No full text
    In this dissertation, we consider techniques for automatic code generation and code optimization of loop programs for programmable tightly coupled processor array targets. These consist of interconnected small light-weight very long instruction word cores, which can exploit both loop-level parallelism and instruction-level parallelism. These arrays are well suited for executing computeintensive nested loop applications, often providing a higher power and area efficiency compared with commercial off-the-shelf processors. They are ideal candidates for accelerating the computation of nested loop programs in future heterogeneous systems, where energy efficiency is one of the most important design goals for overall system-on-chip design. In order to harness the full compute potential of such an array, we need efficient compiler techniques which can automatically map nested loop programs onto them. Such a compiler framework is essential for increasing the productivity of designers as well as for shortening development cycles. In this context, this dissertation proposes a novel code generation and compaction approach which generates the assembly-level codes for all the processing elements in an array from a scheduled loop nest. The code generation approach itself is independent of the array size, preserves the given schedule, and is independent of the problem size. As part of this compiler framework, we also present a scalable interconnect generation approach where the connections among different processing elements are automatically generated from the same scheduled loop program. Furthermore, we consider the integration of a tightly coupled processor array into a multi-processor systemon- chip: Here, we propose the design of new hardware components such as a global controller, which generates control signals to orchestrate (synchronize) the programs running on the different processing elements, and address generators, which are required to generate the address as well as enable signals for a set of reconfigurable I/O buffers surrounding the processor array. We propose a fully programmable design of these required hardware components and add the required compiler support to generate the configuration data from the same scheduled loop program as well. In summary, the major contributions of this dissertation enable and ease the fully automated mapping of nested loop programs onto tightly coupled processor arrays
    corecore